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Abstract 

Recently several authors have proposed stochastic evolutionary models for the growth 
of complex networks that give rise to power-law distributions. These models are based 
on the notion of preferential attachment leading to the "rich get richer" phenomenon. 
Despite the generality of the proposed stochastic models, there are still some unexplained 
phenomena, which may arise due to the limited size of networks such as protein, e-mail, 
actor and collaboration networks. Such networks may in fact exhibit an exponential 
cutoff in the power-law scaling, although this cutoff may only be observable in the tail 
of the distribution for extremely large networks. We propose a modification of the basic 
stochastic evolutionary model, so that after a node is chosen preferentially, say according 
to the number of its inlinks, there is a small probability that this node will become 
inactive. We show that as a result of this modification, by viewing the stochastic process 
in terms of an urn transfer model, we obtain a power-law distribution with an exponential 
cutoff. Unlike many other models, the current model can capture instances where the 
exponent of the distribution is less than or equal to two. As a proof of concept, we 
demonstrate the consistency of our model empirically by analysing the Mathematical 
Research collaboration network, the distribution of which is known to follow a power law 
with an exponential cutoff. 

1 Introduction 

Power-law distributions taking the form 

f{i) = Ci~\ (1) 

where C and r are positive constants, are abundant in nature |SorOO| . The constant r is called 
the exponent of the distribution. Examples of such distributions are: Zipf's law, which states 
that the relative frequency of words in a text is inversely proportional to their rank, Pareto 's 
law, which states that the number of people whose personal income is above a certain level 
follows a power-law distribution with an exponent between 1.5 and 2 (Pareto's law is also 
known as the 80:20 law, stating that about 20% of the population earn 80% of the income) and 
Gutenberg-Richter's law, which states that over a period of time, the number of earthquakes 
of a certain magnitude is roughly inversely proportional to the magnitude. Recently, several 
researchers have detected power-law distributions in the topology of several networks such as 
the World- Wide- Web jBKM+OOl lKRR + 0n| and author citation graphs |ReH98j . 
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The motivation for the current research is two-fold. First, from a complex network per- 
spective, we would like to understand the stochastic mechanisms that govern the growth of a 
network. This has lead to fruitful interdisciplinary research by a mixture of Computer Scien- 
tists, Mathematicians, Statisticians, Physicists, and Social Scientists |AB021 IDMOOl IKRL001 
ILFLW021 IJMewOll lPFL+02| . who are actively involved in investigating various characteristics 
of complex networks such as the degree distribution of the nodes, the diameter, and the rela- 
tive sizes of various components. These researchers have proposed several stochastic models 
for the evolution of complex networks; all of these have the common theme of preferential 
attachment — which results in the "rich get richer" phenomenon — for example, where new 
links to existing nodes are added in proportion to the number of links to these nodes currently 
present. 

An extension of the preferential attachment model, proposed in [DMOOj , takes into account 
the ageing of nodes so that a link is connected to an old node, not only preferentially, but also 
depending on the age of the node: the older the node is the less likely it is that other nodes will 
be connected to it. It was shown in jDMOOj that if the ageing function is a power law then the 
degree distribution has a phase transition from a power-law distribution, when the exponent 
of the ageing function is less than one, to an exponential distribution, when the exponent 
is greater than one. A different model of node ageing was proposed in ASBSOO with two 
alternative ageing functions. With the first function the time a node remains 'active', i.e. may 
acquire new links, decays exponentially, and with the second function a node remains active 
until it has acquired a maximum number of links. Both functions were shown by simulation 
to lead to an exponential cutoff in the degree distribution, and for strong enough constraints 
the distribution appeared to be purely exponential. Another explanation of the cutoff, offered 
in |MBSA02] . is that when a link is created the author of the link has limited information 
processing capabilities and thus only considers linking to a fraction of the existing nodes, 
those that appear to be "interesting" . It was shown by simulation that when the fraction of 
"interesting nodes" is less than one there is a change from a power-law distribution to one 
that exhibits an exponential cutoff, leading eventually to an exponential distribution when 
the fraction is much less than one. 

Second, a motivation for this research is that the viability and efficiency of network algo- 
rithmics are affected by the statistical distributions that govern the network's structure. For 
example, the discovered power-law distributions in the web have recently found applications 
in local search strategies in web graphs |ALPH0l] , compression of web graphs |AM01| and an 
analysis of the robustness of networks against error and attack |AJB001 IJMBOOl] . 

Despite the generality of the proposed stochastic models for the evolution of complex 
networks, there are still some unexplained phenomena; these may arise due to the limited size 
of networks such as protein, e-mail, actor and collaboration networks. Such networks may in 
fact exhibit an exponential cutoff in the power-law scaling, although this cutoff may only be 
observable in the tail of the distribution for extremely large networks. The exponential cutoff 
is of the form 

f(i) = C i~ T q\ (2) 

with < q < 1. The exponent r in (J2J) will be smaller than the exponent that would be 
obtained if we tried to fit to the data a power law without a cutoff, like (|T|). Unlike many 
other models leading to power-law distributions, models with a cutoff can capture situations 
in which the exponent of the distribution is less than or equal to two, which would otherwise 
have infinite expectation. 
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An exponential cutoff has been observed in protein networks |JMBO0l] . in e-mail networks 
|EMB02| . in actor networks |A~SB~S00], in collaboration networks [NewOll IGro02| . and is 
apparently also present in the distribution of inlinks in the web graph |MBS A02] . where a 
cutoff had not previously been observed. We believe it is likely, in many such cases where 
power-law distributions have been observed, that better models would be obtained with an 
exponential cutoff like ([2~|). with q very close to one. 

The main aim of this paper is to provide a stochastic evolutionary model for a class of 
networks like collaboration networks that result in asymptotic power-law distributions with an 
exponential cutoff. This model also enables us to explain some phenomena where the exponent 
is less than or equal to two. As with many of these stochastic growth models, the ideas 
originated from Simon's visionary paper published in 1955 |Sim55j . At the very beginning 
of his paper, in equation (1.1), Simon observed that the class of distribution functions he 
was about to analyse can be approximated by a distribution like (J2J; he called the term q l 
the convergence factor and suggested that q is close to one. He then went on to present his 
well-known model that yields power-law distributions like (|T|). and which has provided the 
basis for the models rediscovered over 40 years later. Simon gave no explanation for the 
appearance, in practice, of the convergence factor. 

In a previous paper |FLL05a] . we dealt with a related class of networks that exhibit an 
exponential cutoff, such as protein interaction networks, in which after a protein is chosen 
preferentially, say according to the number of other proteins it interacts with, there is a small 
probability that this protein is discarded from the network. E-mail networks and the web 
graph are further examples belonging to this class of network. However, in this paper we 
consider other networks that behave differently, such as collaboration and actor networks. 
Consider a collaboration network: after an author is chosen preferentially, according to the 
number of collaborators he/she currently has, there is a small probability that this author 
will become inactive, but he/she will not be removed from the network. Inactive authors 
do not start new collaborations but their existing collaborations still persist in the network. 
Possible reasons for inactivity may be the finite time window of the data used or because an 
author retires from collaborative writing. 

The rest of the paper is organised as follows. In Section|2]we present an urn transfer model 
that extends Simon's model by allowing an author, chosen as described above, to sometimes 
become inactive. We then derive the steady-state distribution of the model, which, as stated 
earlier, follows an asymptotic power law with an exponential cutoff like PJl. In Section we 
demonstrate that our model can provide an explanation of the empirical distributions found 
in collaboration networks. Finally, in Section 0] we give our concluding remarks. 

2 An Urn Transfer Model for Collaboration Networks 

We now briefly present an urn transfer model for a stochastic process that emulates the situ- 
ation where balls (which might represent authors) become inactive with a small probability, 
but still remain in the system. We assume that a ball in the ith urn has i pins attached to it 
(which might represent the author's collaborations). The model is a variant of our previous 
model of exponential cutoff |FLL05a] . where balls are discarded with a small probability. 

We assume a countable number of (unstarred) urns, urn\,urn2,urns, . . . and a corre- 
sponding set of starred urns urnj, urn^, urn^, • • • , where the latter contain the inactive balls. 
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Initially all the urns are empty except urni, which has one ball in it. Let Fi(k) and F*(k) 
be the number of balls in urni and urn*, respectively, at stage k of the stochastic process, 
so Fi(l) = 1, all other Fj(l) = and all F*(l) = 0. Then, at stage k + 1 of the stochastic 
process, where k > 1, one of two things may occur: 

(i) with probability p, < p < 1, a new ball (with one pin attached) is inserted into urni, 
or 

(ii) with probability 1 — p an urn is selected, with urrii being selected with probability 
proportional to iFi(k), the number of pins it contains (or attached to it), and a ball is 
chosen from the selected urn, urni say; then, 

(a) with probability q, < q < 1, the chosen ball is transferred to urni + i, (this is 
equivalent to attaching an additional pin to the ball chosen from urni), or 

(b) with probability 1 — q the ball chosen is transferred to urn* (this is equivalent to 
making the ball inactive). 

In terms of our model |FLL05a] . this means that instead of discarding a ball from urni, 
say, the ball is transferred into the corresponding starred urn, urn*. A ball in a starred urn 
takes no further part in the stochastic process, i.e. it does not acquire any further pins and so 
never moves from its urn. In particular, balls in starred urns have no effect on the preferential 
choices made during the continuation of the stochastic process. 

We could modify the initial conditions so that, for example, urn± initially contained 5 > 1 
balls instead of one. It can be shown that any change in the initial conditions will have no 
effect on the asymptotic distribution of the balls in the urns as k tends to infinity, provided 
the process does not terminate with all of the unstarred urns empty. 

In order for this not to occur it is necessary that, on average, more balls are added to the 
system than become inactive. To ensure this we require p > (1 — p)(l — q), see |FLLf)5a] . 
In practice this condition will nearly always hold, so from now on we assume this. This 
condition implies that the probability that the urn transfer process will not terminate with 
all the unstarred urns being empty is positive. 

More specifically, the probability of non-termination is 1 — ((1 — p)(l — q)/p) S ', this is just 
the probability that the gambler's fortune will increase forever |Ros83j . 

Following Simon |Sim55j , we now state the mean-field equations for the urn transfer model. 
For i > 1 we have 

E k {F l {k + l))=F i (k)+f3 k (q(i-l)F^ l {k)-iF i {k)), (3) 



A = ^-frrr (4) 



where E k {Fi(k + 1)) is the expected value of Fi(k + 1) given the state of the model at stage 
k, and 

l-P 
£?=i iFi{k) 
is the normalising factor. 

Equation © gives the expected number of balls in urni at stage k + 1. This is equal to 
the previous number of balls in urni plus the probability of adding a ball to urni from urni-i 
in step (ii)(a) minus the probability of removing a ball from urni m step (ii). 
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In the boundary case, i = 1, we have 

E k (F 1 (k + l)) = F 1 (k)+p-/3 k F 1 {k), (5) 

for the expected number of balls in urn\, which is equal to the previous number of balls in 

the first urn plus the probability of inserting a new ball into urn\ in step (i) of the stochastic 
process defined above minus the probability of removing a ball from urn\ in step (ii). 

For starred urns, for i > 1, corresponding to © and (JSJ), we have 

E k (F*(k + 1)) = F*(k) + (1 - q)/3 k iFi(k). (6) 



In order to solve the equations for the model, we make the assumption that, for large k, 
the random variable {3 k can be approximated by a constant (i.e. non-random) value depending 
only on k. We do this by replacing the denominator in the definition of (3 k by an asymptotic 
approximation of its expectation. We observe that approximating f5 k in this way is essentially 
a mean-field approach |BAJ99j . 

Let 9^ be the expected value of the average number of pins attached to a ball in a starred 
urn at stage k. We have shown |FLL05aj that 6^ is bounded above by 1/(1 — q), so it is 
reasonable to make the assumption that 9^ tends to a limiting value 9 as k tends to infinity. 
It is easy to see that the total number of pins attached to the balls in the unstarred urns (i.e. 
the active balls) at stage k is asymptotically 

(p + (1 - P)q - (1 - p)(l - q)9)k + O(l). 

Therefore, letting 

1 — p 

^ = P+{l-p)q-{l-p){l-qW (7) 
we see that k(5 k tends to f3 as k tends to infinity. 

If we now make the further assumption that 

0< fc ) =0 + O(l/Jfe), 

then it is possible to show |FLLf)5a] that the expected value of Fi(k) is asymptotically pro- 
portional to k, i.e. E(Fi(k))/k tends to a limit fi as k tends to infinity. It similarly follows 
that E(F*{k))/k tends to a limit /*. 

Following the derivation in |FLL05a] . we obtain 

/^((/(i-l)/^! -*/,), (8) 

for i > 1, and 

h=P~Ph- (9) 

The solution of these equations is 

u = qTfiTT+i) Wt' (10) 
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where g = 1//3, T is the gamma function f AS 721 6.1] and 



_ T(l + g) 
Q 

The asymptotic approximation to /j , i.e. (|1U|) . in the form corresponding to (J2J) is obtained 
using Stirling's approximation |AS721 6.1.39]. 

For the starred urns, corresponding to (JHJ and ©, from (JHJ) we have, for % > 1, 

f* = (l-q)(3iU (11) 
Thus the ratio of active balls in urnj to inactive balls in urn* is 

/i £» 



/* (1 " q)i 

It follows that, for large i, the distribution of the balls is dominated by the contents of 
the starred urns rather than the unstarred urns. Thus the distribution of the total number 
of balls with i pins is given by 

/i+/ .^fI + i^V (12) 



In the following section we will make use of the equation 

(l- P )(l + g)=pF(l,2;2 + j;?), (13) 

where F is the hypergeometric function |AS721 15.1.1]. This can be derived by using 1)10(1 
to obtain the sum of ifi for % > 1; this is just the asymptotic value of the total number of 
pins attached to the balls in the unstarred urns divided by k. However, from Q and the 
discussion preceding it, this sum is also equal to (1 — p)/P, i.e. g(l — p), see |FLL05aj for 
further details. 



3 Collaboration Networks 

As a proof of concept we will consider the Mathematical Research (MR) collaboration network 
for which an exponential cutoff has been reported |Gro02| . In our model it is assumed that an 
author enters the network with a single collaboration, which could be interpreted as a "self- 
collaboration" . Thereafter, each time an author acquires a new collaborator the corresponding 
ball is moved along to the next urn with an additional pin attached to it. There is also a 
certain probability that an author becomes inactive. Authors who are no longer active still 
remain part of the network, although they will not be involved in any new collaborations. 

We note that collaboration networks, together with some other types of network, like 
protein and actor networks, are essentially undirected. So in our model a new collaboration 
between two authors should be represented by two separate events, one for each author. This 
would correspond to taking in pairs the events of attaching a pin to a ball. We ignore this 
complication, but note that many of the models proposed, for example for the web graph, 
similarly ignore the difference between directed and undirected graphs (e.g. |BA99| ). 
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We now examine in detail the degree distribution of the MR collaboration network. The 
data for this was supplied to us by Jerry Grossman at the Department of Mathematics and 
Statistics in Oakland University, Rochester |Gro02j . In order to derive the values for g and 
q, we performed a nonlinear regression on a log-log transformation of the degree distribution 
of the MR collaboration network to fit the equation 

y = a — q x + exp(x) In q + ln(exp(— x) + (1 — q)/g) , (14) 

corresponding to ()12|) . where a is a constant. 

The results are shown in Figure ^ The values of g and q obtained from the regression of 
the complete MR data set (129 points) are g = 1.179 and q = 0.9658. 



MR collaboration data 
Regression curve - all 129 points 
Regression curve - first 91 points 




1.5 2 2.5 3 3.5 
Log of number of collaborators 



Figure 1: Mathematical Research collaboration data 

We next performed a stochastic simulation to test the validity of our model with respect to 
the results of the regression on the original data set. In order to use this data for a stochastic 
simulation of our model, we require the values for p and k, using g and q computed from the 
above regression. 

We calculated a value for k to use in simulating our stochastic model from: 

balls k + ballst /ir , 
(15) 

This follows from the formulation of the model, where ballst and ballst stand for the 
expected numbers of balls at stage k in the unstarred and starred urns, respectively. The 
right-hand side of (|15|) is the limiting value of the left-hand side as k tends to infinity. 
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Similarly, from the formulation of the model, we have 

pinsk + pins* k 



k 



l-(l-p)(l-g). 



(16) 



On using (|15|) and (|16|) . we obtain an alternative equation for p, given by 



P 



bf + q-l 



(17) 



where bf = (pinsk + pins* k ) / {balls k + balls* k ) is the branching factor. 

From the data we see that the total number of researchers was 253339, and the total 
number of collaborations was 992978. Using these values for ballsk + balls k and pins k + pins k , 
respectively, gives us the branching factor for the original data set as bf = 3.9196. 

We can now obtain the values of p and k. Computing p from (|13j) gives p = 0.3351 and 
from (|17j) gives p = 0.2486. Using the first value of p we obtain the alternative values for k 
from l)15|) or 1)16(1 as k = 756010 or k = 1016083, respectively, and using the second value of 
p gives us a value of k = 1019170 from either (|15j) or (|16j) . 

We then carried out 10 simulation runs (a batch) of the stochastic process for the three 
combinations of the values of p, q and k. The results from the three batches are shown in 
Tabled For each batch we report the average output values for ballsk + balls k , pinsk +pins* k 
and g. As a further validation of our methodology, we computed the average number of balls 
in each urn for each of the three batches, and performed a nonlinear regression, taking into 
account all urns until an empty one was encountered. The values of q and g obtained from 
this regression are shown in the row following the results for each batch. 

For the first batch, it can be seen that the values of ballsk+balls k and g are consistent with 
the data but the value of pinsk + pins* k is less consistent, since, in this case, we computed 
k from (|15|) . For the second batch, it can be seen that the values of pinsk + pins k and g 
are consistent with the data but the value of ballsk + balls k is less consistent, since, in this 
case, we computed k from (|16|) . Finally, for the third batch, it can be seen that the values 
of ballsk + balls* k and pinst + pins* k are consistent with the data but the value of g is less 
consistent, since p, computed from (|17|) . is less constrained than when it is computed from 
(|13|) . which takes g into account. It is also evident that value of g computed from the nonlinear 
regression on the urn values from the simulation is, for all batches, below the value predicted 
from the simulation. 



Simulation 


q 


P 


k 


ballsk + balls k 


pinsk + pinsl 


Q 


Data 


0.9658 






253339.0 


992978.0 


1.1790 


Batch 1 




0.3351 


756010 


253343.5 


738836.6 


1.1786 


Regression 


0.9625 










1.0270 


Batch 2 




0.3351 


1016083 


340592.2 


993041.2 


1.1795 


Regression 


0.9641 










1.0530 


Batch 3 




0.2486 


1019170 


254116.5 


993518.4 


0.9055 


Regression 


0.9640 










0.8181 



Table 1: Summary of simulations for parameters derived from the full MR data set 
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We observe that there are problems in fitting power-law type distributions, due to diffi- 
culties with non-monotonic fluctuations in the tail. (Another reason maybe the sensitivity of 
the nonlinear regression to the cutoff parameter q.) In particular, the presence of gaps in the 
distribution of balls in the urns is the main manifestation of this problem. There is a gap 
in this distribution at urni if there are no balls in urni but there exists at least one ball in 
urTij , where j > i. We discussed this problem more fully in the context of a pure power-law 
distribution in |FLL05b] . and concluded that a preferable approach is to ignore all data points 
from the first gap onwards. Evidence of the advantage of discarding data points in the tail of 
the distribution was also given in |(tMY04] . where the more radical approach of using only 
the first five data points is suggested. In the MR data set the first gap occurs at i = 92. 

As a further test of the validity of the model, we created a truncated data set by keeping 
only the first 91 data points of the MR data set. The regression curve, for the first 91 points 
in the data set, is also shown in Figure Q where the values for g and q obtained from the 
regression are g = 0.8347 and q = 0.9438. 

Using these values for g and q we obtained alternative values for p and q. Computing 
p from (|13j) gives p = 0.2650 and from (|17j) gives p = 0.2443. Using the value p = 0.2650 
we obtain, from ()15|) or ()16j) . the alternative values for k as k = 955996 or k = 1035762, 
respectively, and using the value p = 0.2443 gives us a value of k = 1037021 from either (|15j) 
or ifTHjl. 

We then carried out 10 further simulation runs (a batch) of the stochastic process for 
the three combinations of the values of p, q and k, derived from the truncated data set. The 
results from these further three batches are shown in Table 121 

For the first batch, it can be seen that the values of ballsk + balls k and g are consistent 
with the data but the value of pins/- + pins k is less consistent, although it is closer to its 
computed value compared to the value 738836.6 obtained from the previous simulations on 
the full MR data set. For the second batch, it can be seen that the values of pins^ + pins k 
and g are consistent with the data but the value of balls k + balls* k is less consistent, although 
it is much closer to its computed value compared to the value 340592.2 obtained from the 
previous simulations on the full data set. Finally, for the third batch, it can be seen that the 
values of ballsk + balls* k and pins^ + pins* k are consistent with the data but the value of g 
is less consistent, although it is much closer to its computed value 0.8347 compared to the 
value 0.9055 obtained from the previous simulations on the full data set; the latter is further 
away from 1.179. As for the full data set, it is also evident that value of g computed from the 
nonlinear regression on the urn values from the simulation is, for all batches, below the value 
predicted from the simulation. 

Overall the results show that the data is consistent with our model, and that the results 
of the simulations better match the truncated data set. It is important to note that small 
variations in q obtained from the nonlinear regression may result in relatively large variations 
in the regressed value of g. 

4 Concluding Remarks 

We have presented an extension of Simon's classical stochastic process, which results in a 
power-law distribution with an exponential cutoff. When viewing the stochastic process in 
terms of an urn transfer model, the difference from the classical process is that, after a ball is 
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Simulation 


1 


V 


k 


ballsy + balls k 


pinsk + pins^. 


Q 


Data 


0.9438 






253339.0 


992978.0 


0.8347 


Batch 1 




0.2650 


955996 


253585.3 


916594.8 


0.8353 


Regression 


0.9402 










0.7273 


Batch 2 




0.2650 


1035762 


274969.4 


993029.5 


0.8358 


Regression 


0.9468 










0.8122 


Batch 3 




0.2433 


1037021 


253840.0 


993041.8 


0.7681 


Regression 


0.9428 










0.6975 



Table 2: Summary of simulations for parameters derived from the truncated MR data set 



chosen on the basis of preferential attachment, with probability 1—q the ball becomes inactive. 
By following the mean- field approach, we derived the asymptotic formula (|12|). which shows 
that the distribution of the number of balls in the urns approximately follows a power-law 
distribution with an exponential cutoff. 

Exponential cutoffs have been identified in protein, e-mail, actor and collaboration net- 
works, and possibly in the web graph |MBS A02] : it is likely that exponential cutoffs also 
occur in other complex networks. Here we have dealt with networks such as collaboration 
and actor networks, where preferentially chosen authors/actors may become inactive; in a 
previous paper ( |FLL05aj ) we have dealt with networks such as protein and e-mail networks, 
where preferentially chosen proteins/e-mail accounts may be discarded from the network. We 
demonstrated the applicability of our model using data from the Mathematical Research col- 
laboration network, thus showing that our model offers a plausible explanation for certain 
processes that give rise to a power-law distribution with an exponential cutoff. 
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